ML - M1 - INTRODUCTION TO MACHINE LEARNING

 Linear Regression imposes g(.) to be a linear function of the inputs to predict the labels:

yb = g(X) = θ0 + θ1x1 + θ2x2 + . . . + θnxn

▶ yb is the predicted value for the label.

▶ n is the number of features.

▶ xj is the j th input feature.

▶ θk is the k th model parameter, including the bias term θ0.




In Machine Learning applications, we usually opt for a generic optimization algorithm that is capable of finding optimal solutions to a wide range of problems: Gradient Descent.


To measure the goodness of fit of the model, we can compute the coefficient of determination, or R^2




where y is the average label in the training sample. A low R^2
in the training sample suggests underfitting.

The R^ 2 of the linear regression indicates the proportion of the labels’ variance explained by the model. 

▶ We can also understand the R^ 2 as the prediction gains made by the model against a prediction model that just uses the sample average to predict the labels. In the training sample (or in-sample) the R^2 of the Linear Regression will always be above zero (if there is a bias term). In the test sample (or out-of-sample) the R^2 of Linear Regression may be negative:

▶ A model may be arbitrarily bad at making predictions out-of-sample, even worse than making a
prediction based on the average. A potential solution is to increase the complexity of the model.

Gradient descent is an algorithm that helps the network adjust its weights and biases in the right direction to minimize the cost function.

Different ways to avoid overfitting:

▶ Gather more training data to reduce noise or clean the data to fix errors or remove outliers.

▶ Regularization. Force the trainable parameters to be relatively small and avoid overfitting.

▶ Early stopping. Introduce a validation sample to stop training as soon as the model starts to generalize badly.

▶ Tuning of hyperparameters. Determine the combination that best fits the trained model in the validation sample.

Hyperparameter: a parameter of a learning algorithm that does not explicitly serve to predict the labels. As such, hyperparameters are fixed during training.